The goal is to understand how travel behavior (e.g., distances traveled for different modes of transportation) varies across different counties through a spatial regression analysis.
This vignette uses a database of 26,095 sample households residing in
California, containing detailed information on their travel behavior on
one assigned day for each household, from April 19, 2016 through April
25, 2017, provided by the Transportation Secure Data Center (TSDC). [1]
Specifically, we obtained travel mode data from
PersonData.Rds, HHData.Rdsand geographical
data, i.e. the physical location and shape of each county,
fromcounties.shp, files given in the dataset.
[1]“Transportation Secure Data Center.” (2019). National Renewable Energy Laboratory. Accessed Jan. 15, 2019: www.nrel.gov/tsdc.
Geographically Weighted Regression (GWR) is a spatial analysis technique that extends traditional regression by allowing the relationships between dependent and independent variables to vary spatially. Unlike ordinary least squares (OLS) regression, which assumes global stationarity of the coefficients, GWR incorporates geographic context into the model. This approach accounts for spatial heterogeneity, a common characteristic in spatial datasets, where relationships can change over space due to localized factors.
In GWR, the regression is performed repeatedly for each location in the dataset, weighting observations according to their spatial proximity to the focal location. The weighting is determined using a kernel function, which can be fixed or adaptive, depending on the data’s spatial distribution.[2]
The selection of an appropriate bandwidth is a crucial step for GWR model. Bandwidth is a parameter that governs the spatial extent, over which neighboring observations influence the estimation of local parameters. The bandwidth serves as a key filter determining the degree of localization in the analysis.
A bandwidth that is too narrow may lead to oversensitivity to local variations, potentially capturing noise in the data. On the other hand, too broad bandwidths can result in over smoothed representations , masking subtle spatial patterns. With a proper bandwidth value, we are able to achieve the balance to ensure the GWR model accurately captures the true spatial heterogeneity without being unduly influenced by distant observations.
Adaptive bandwidths offer an effective solution, as they can vary based on the size of each geographical area and that of its neighbors. Thus, the model can select a narrower bandwidth in dense areas, and a larger one for suburban areas. [3]
In our analysis, we determined the optimal bandwidth using the `bw.gwr` function, which is 35.
[2] Charlton, M., & Fotheringham, A. S. (2009). Geographically weighted regression. [White Paper].
[3] Kiani et al.(2024, February 29). Mastering geographically weighted regression: Key considerations for building a robust model: Geospatial Health. Mastering geographically weighted regression: key considerations for building a robust model | Geospatial Health. https://www.geospatialhealth.net/gh/article/view/1271/1365
Group the data by county (CTFIP), and calculates the average total number of miles traveled by a person for each travel mode (Drive Alone, Drive with Others, Passenger, Walk, and Total) on the survey day.
shapefile_sum <- shapefile %>%
left_join(summarized_data, by = "CTFIP")
mapviewOptions(fgb = F)
mapview(shapefile_sum,
zcol = "avg_Sum_Pmt", # assigned color based on sum distance
legend = TRUE,
label = as.character(shapefile_sum$CTFIP),
popup = popupTable(shapefile_sum,
zcol = c("avg_DriveAlone_Dist",
"avg_Driveothers_Dist",
"avg_Passenger_Dist",
"avg_Walk_Dist",
"avg_Bike_Dist",
"avg_Sum_Pmt")))
Use bw.gwr() to find the optimal bandwidth for GWR
analysis.
coords <- st_coordinates(st_centroid(st_geometry(shapefile_sum)))
gwr_data <- shapefile_sum %>%
select(avg_DriveAlone_Dist, avg_Driveothers_Dist, avg_Passenger_Dist, avg_Walk_Dist, avg_Bike_Dist, avg_Sum_Pmt) %>%
st_drop_geometry()
gwr_data <- cbind(gwr_data, coords)
gwr_formula <- avg_Sum_Pmt ~ avg_DriveAlone_Dist + avg_Driveothers_Dist + avg_Passenger_Dist + avg_Walk_Dist + avg_Bike_Dist
# convert to spatial data
coordinates(gwr_data) <- ~X + Y
proj4string(gwr_data) <- CRS("+proj=longlat +datum=WGS84")
# Perform bandwidth selection for GWR
bw <- bw.gwr(
formula = gwr_formula,
data = gwr_data,
adaptive = T)
## Adaptive bandwidth: 43 CV score: 36.66803
## Adaptive bandwidth: 35 CV score: 36.27212
## Adaptive bandwidth: 28 CV score: 40.56964
## Adaptive bandwidth: 37 CV score: 37.88228
## Adaptive bandwidth: 31 CV score: 40.03792
## Adaptive bandwidth: 34 CV score: 36.77004
## Adaptive bandwidth: 31 CV score: 40.03792
## Adaptive bandwidth: 32 CV score: 38.86713
## Adaptive bandwidth: 31 CV score: 40.03792
## Adaptive bandwidth: 31 CV score: 40.03792
## Adaptive bandwidth: 30 CV score: 40.67171
## Adaptive bandwidth: 30 CV score: 40.67171
## Adaptive bandwidth: 29 CV score: 41.34116
## Adaptive bandwidth: 29 CV score: 41.34116
## Adaptive bandwidth: 28 CV score: 40.56964
## Adaptive bandwidth: 28 CV score: 40.56964
## Adaptive bandwidth: 27 CV score: 40.22093
## Adaptive bandwidth: 27 CV score: 40.22093
## Adaptive bandwidth: 26 CV score: 40.56296
## Adaptive bandwidth: 26 CV score: 40.56296
## Adaptive bandwidth: 25 CV score: 40.20452
## Adaptive bandwidth: 25 CV score: 40.20452
## Adaptive bandwidth: 24 CV score: 39.88842
## Adaptive bandwidth: 24 CV score: 39.88842
## Adaptive bandwidth: 23 CV score: 40.12316
## Adaptive bandwidth: 23 CV score: 40.12316
## Adaptive bandwidth: 22 CV score: 40.45429
## Adaptive bandwidth: 22 CV score: 40.45429
gwr_model <- gwr.basic(
formula = gwr_formula,
data = gwr_data,
bw = bw,
adaptive = T)
gwr_result <- gwr_model$SDF
summary(gwr_result)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## X -123.89432 -115.36552
## Y 33.03547 41.74308
## Is projected: FALSE
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Number of points: 58
## Data attributes:
## Intercept avg_DriveAlone_Dist avg_Driveothers_Dist avg_Passenger_Dist
## Min. :-0.4934 Min. :0.9145 Min. :0.8048 Min. :0.8516
## 1st Qu.: 0.4961 1st Qu.:0.9518 1st Qu.:0.9750 1st Qu.:0.8755
## Median : 0.8308 Median :0.9979 Median :1.0536 Median :0.9178
## Mean : 0.9485 Mean :1.0118 Mean :1.0237 Mean :0.9157
## 3rd Qu.: 1.4073 3rd Qu.:1.0629 3rd Qu.:1.0721 3rd Qu.:0.9444
## Max. : 2.5491 Max. :1.1466 Max. :1.1642 Max. :1.0113
## avg_Walk_Dist avg_Bike_Dist y yhat
## Min. :-0.9423 Min. :1.616 Min. :18.13 Min. :19.12
## 1st Qu.: 1.6040 1st Qu.:2.339 1st Qu.:25.74 1st Qu.:25.73
## Median : 4.0746 Median :2.797 Median :28.15 Median :28.10
## Mean : 3.5652 Mean :2.692 Mean :28.95 Mean :28.98
## 3rd Qu.: 5.6299 3rd Qu.:3.087 3rd Qu.:31.11 3rd Qu.:31.31
## Max. : 6.9179 Max. :3.916 Max. :40.88 Max. :40.47
## residual CV_Score Stud_residual Intercept_SE
## Min. :-1.035504 Min. :0 Min. :-2.66169 Min. :0.6972
## 1st Qu.:-0.321663 1st Qu.:0 1st Qu.:-0.75237 1st Qu.:0.8132
## Median : 0.005585 Median :0 Median : 0.01299 Median :0.9091
## Mean :-0.029618 Mean :0 Mean :-0.11165 Mean :0.9201
## 3rd Qu.: 0.244794 3rd Qu.:0 3rd Qu.: 0.49366 3rd Qu.:1.0281
## Max. : 1.141008 Max. :0 Max. : 3.06768 Max. :1.2451
## avg_DriveAlone_Dist_SE avg_Driveothers_Dist_SE avg_Passenger_Dist_SE
## Min. :0.04545 Min. :0.07797 Min. :0.07302
## 1st Qu.:0.04917 1st Qu.:0.08661 1st Qu.:0.08168
## Median :0.05427 Median :0.09965 Median :0.09066
## Mean :0.05475 Mean :0.09926 Mean :0.09005
## 3rd Qu.:0.05991 3rd Qu.:0.10985 3rd Qu.:0.09689
## Max. :0.07354 Max. :0.12553 Max. :0.11771
## avg_Walk_Dist_SE avg_Bike_Dist_SE Intercept_TV avg_DriveAlone_Dist_TV
## Min. :0.8861 Min. :1.050 Min. :-0.6075 Min. :12.72
## 1st Qu.:1.0182 1st Qu.:1.122 1st Qu.: 0.5059 1st Qu.:17.83
## Median :1.1827 Median :1.170 Median : 0.9482 Median :19.25
## Mean :1.4982 Mean :1.234 Mean : 0.9975 Mean :18.72
## 3rd Qu.:1.5748 3rd Qu.:1.337 3rd Qu.: 1.6638 3rd Qu.:20.28
## Max. :3.3180 Max. :1.605 Max. : 2.6042 Max. :21.74
## avg_Driveothers_Dist_TV avg_Passenger_Dist_TV avg_Walk_Dist_TV
## Min. : 6.411 Min. : 7.903 Min. :-0.6299
## 1st Qu.: 9.098 1st Qu.: 9.168 1st Qu.: 1.0900
## Median :10.650 Median :10.414 Median : 2.0566
## Mean :10.565 Mean :10.306 Mean : 2.9079
## 3rd Qu.:12.364 3rd Qu.:11.350 3rd Qu.: 5.5485
## Max. :13.466 Max. :12.864 Max. : 6.4995
## avg_Bike_Dist_TV Local_R2
## Min. :1.107 Min. :0.9879
## 1st Qu.:1.860 1st Qu.:0.9922
## Median :2.414 Median :0.9937
## Mean :2.236 Mean :0.9937
## 3rd Qu.:2.723 3rd Qu.:0.9952
## Max. :3.160 Max. :0.9978
head(gwr_result)
## class : SpatialPointsDataFrame
## features : 6
## extent : -122.8884, -118.8004, 35.38639, 38.52794 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## variables : 24
## names : Intercept, avg_DriveAlone_Dist, avg_Driveothers_Dist, avg_Passenger_Dist, avg_Walk_Dist, avg_Bike_Dist, y, yhat, residual, CV_Score, Stud_residual, Intercept_SE, avg_DriveAlone_Dist_SE, avg_Driveothers_Dist_SE, avg_Passenger_Dist_SE, ...
## min values : -0.493407547943922, 0.931195113605572, 0.804772664961579, 0.882009855991374, 2.07589095658566, 1.93713278106003, 25.5660966027259, 26.1724569854574, -0.780451312708898, 0, -2.66168916076884, 0.812249140951841, 0.0454497445608676, 0.0779715804092984, 0.0830799886053281, ...
## max values : 2.15676300222343, 1.14663162009081, 1.07245452213885, 1.0113345223337, 6.82075707909126, 3.53063802804501, 38.0691114580212, 38.8495627707301, 0.231925827520939, 0, 0.451298627278192, 1.1574966922564, 0.0668625157233102, 0.125534232867506, 0.110703232342893, ...
gwr_sf <- st_as_sf(gwr_result)
ggplot(data = gwr_sf) +
geom_sf(aes(color = avg_DriveAlone_Dist)) +
geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "Spatial Variation of avg_DriveAlone_Dist Coefficient",
color = "Coefficient Value")
ggplot(data = gwr_sf) +
geom_sf(aes(color = avg_Driveothers_Dist)) +
scale_color_viridis_c() +
geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
theme_minimal() +
labs(title = "Spatial Variation of avg_Driveothers_Dist Coefficient",
color = "Coefficient Value")
ggplot(data = gwr_sf) +
geom_sf(aes(color = avg_Passenger_Dist)) +
geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "Spatial Variation of avg_Passenger_Dist Coefficient",
color = "Coefficient Value")
ggplot(data = gwr_sf) +
geom_sf(aes(color = avg_Walk_Dist)) +
geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "Spatial Variation of avg_Walk_Dist Coefficient",
color = "Coefficient Value")
ggplot(data = gwr_sf) +
geom_sf(aes(color = avg_Bike_Dist)) +
geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
scale_color_viridis_c() +
theme_minimal() +
labs(title = "Spatial Variation of avg_Bike_Dist Coefficient",
color = "Coefficient Value")
Coastal areas (especially below 38°N) have significantly higher coefficients (yellow to green), suggesting that walking distance contributes more significantly to changes in total traveling distance.
Around 38°N, 123°W which is the Bay Area, the higher coefficients indicate that the average drive-alone distance contributes more significantly to changes in total traveling distance.
Coefficients are higher in southern and northern regions (green regions), while some coastal areas near 38°N have lower coefficients (dark blue), which suggest a converse relationship to the drive-alone distance.
The coefficients are generally low (mostly dark blue) across all regions, particularly in southern areas, suggesting that passenger distance is less influential.
Coefficients are lower in southern regions (dark purple regions), suggesting that bike distance is less influential.
Driving Alone: Higher influence in the northern regions.
Driving with Others: Higher influence in central and southern regions.
Walking: Stronger impacts in coastal regions.
Cycling: Stronger impacts in northern and central regions.
Passenger: Riding as a passenger is relatively consistent across regions.